Fast Pattern Matching for Entropy Bounded Text

نویسندگان

  • Shenfeng Chen
  • John H. Reif
چکیده

We present the rst known case of one-dimensional and two-dimensional string matching algorithms for text with bounded entropy. Let n be the length of the text and m be the length of the pattern. We show that the expected complexity of the algorithms is related to the entropy of the text for various assumptions of the distribution of the pattern. For the case of uniformly distributed patterns, our one dimensional matching algorithm works in O(n log m=pm)) expected running time where H is the entropy of the text and p = 1 ? (1 ? H 2) H=(1+H). The worst case running time T can also be bounded by n log m p(m+ p V) T n log m p(m? p V) if V is the variance of the source from which the pattern is generated. Our algorithm utilizes data structures and probabilistic analysis techniques that are found in certain lossless data compression schemes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Entropy-Compressed Indexes for Multidimensional Pattern Matching

In this talk, we will discuss the challenges involved in developing a multidimensional generalizations of compressed text indexing structures. These structures depend on some notion of Burrows-Wheeler transform (BWT) for multiple dimensions, though naive generalizations do not enable multidimensional pattern matching. We study the 2D case to possibly highlight combinatorial properties that do n...

متن کامل

Fast Least Square Matching

Least square matching (LSM) is one of the most accurate image matching methods in photogrammetry and remote sensing. The main disadvantage of the LSM is its high computational complexity due to large size of observation equations. To address this problem, in this paper a novel method, called fast least square matching (FLSM) is being presented. The main idea of the proposed FLSM is decreasing t...

متن کامل

Fast and Simple Character Classes and Bounded Gaps Pattern Matching, with Applications to Protein Searching

The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CBG [RK] - x(2,3) - [DE] - x(2,3) - Y, where the brackets match any of the letters inside, and x(2,...

متن کامل

Simple Compression Code Supporting Random Access and Fast String Matching

Given a sequence S of n symbols over some alphabet Σ, we develop a new compression method that is (i) very simple to implement; (ii) provides O(1) time random access to any symbol of the original sequence; (iii) allows efficient pattern matching over the compressed sequence. Our simplest solution uses at most 2h + o(h) bits of space, where h = n(H0(S) + 1), and H0(S) is the zeroth-order empiric...

متن کامل

Very fast pattern matching for highly repetitive text

This paper describes two searching methods for locating longest string matches in source texts of low entropy. A modi cation of the Boyer-Moore scanning algorithm and a statistical method, which searches for less likely symbols, are presented. Both algorithms have been implemented as part of the searching strategy for an LZ77 type encoder. Experimental results are included.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995